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Abstract: We present some extensions of Bernstein's inequality for random self-adjoint 
operators. The main feature of these results is that they can be applied in the infinite- 
dimensional setting. In particular, our inequalities refine the previous results of Hsu, Kakade 
and Zhang. 



1. Introduction 



Theoretical analysis of many problems, such as low-rank matrix recovery and approximate matrix 



multiplication, is built upon exponential bounds for 



> t I where {Xi} is a finite se- 



quence of self-adjoint random matrices and || • || is the operator norm. Starting with the pioneering 
work of R. Ahlswede and A. Winter [AW02], the moment-generating function technique was used 
to produce generalizations of Chernoff, Bernstein and Friedman inequalities to the noncommu- 
tative case; see [Trollb],[Trolla],[01ilO] for thorough treatment and applications. While being 
sufficient for most problems, the explicit dependence on the dimension of the matrix does not 
allow straightforward application of these results in the infinite-dimensional setting. 
The main purpose of this note is to provide a dimension-free version of Bernstein inequality for a 
sequence of independent random matrices as well as for the case of martingale differences. Some 
results in this direction were previously obtained in [HKZll], but with a suboptimal tail. The 
trace quantity appearing in our bounds never exceeds the dimension of the matrix, therefore this 
result can be seen as a generalization of the finite-dimensional case. 

We proceed by stating the main results and giving some applications to estimation of the integral 
operators. 



2. Bernstein's inequality for independent random matrices 

We start with a version of Bernstein's inequality for the sequence of independent self-adjoint 
random matrices. Everywhere below, || • || stands for the operator norm := max|Aj(j4)|, 

i 

where Aj are the eigenvalues of a self-adjoint operator A. Moreover, expectation EX is taken 

elementwise. 

Let ^.(t) := J^. 

Theorem 2.1. Let Xi, . . . ,Xn be a sequence of d x d independent self-adjoint random matrices 
such that KXi = and \\Xi\\ < 1 a.s. 
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Denote a"^ 



i=l 



. Then, for any t > 



i=l 



> H < 2 



tr ( E ^X, 

i=l 

a2 



where rfj{t) = 1 + 



6 



i2lo^l+Vo-2) • 

Remarks: Note that 

n 



i=l 



( Til '^'^i ) — d{va. fact, if ^ EX? is "approximately low rank", i.e. has many small 



1=1 



1=1 



eigenvalues, ^tr ( ^ '^Xf j can be much smaller than d). 



2. ro-(t) is decreasing, so in the range of t when the inequality becomes nontrivial (e.g., for 
t ^ 1/ ( (o"2)~"'^tr ( ^ I ) ), To- can be replaced by a constant. 



V \ \i=i / / 

Proof. The proof follows the lines of [Trollb], where the key role is played by Lieb's concavity 
theorem [Lie73]: 

Theorem (Lieb). Given a fixed self-adjoint matrix H , the function 

A^ii exp {H + log A) 

is concave on a positive definite cone. 

In [Trollb], section 4.8, the advantages over the classic method of Ahlswede and Winter based 
on the Golden-Thompson inequality are discussed. 

n 

Let (t){0) = 6^ — 6 — 1. Note that (f) is nonnegative and increasing on (0,oo). Denote Sn ■= Xi 

i=l 

and note that cr^ = ||ES'^||. First, we reduce the bounds on probability to the bounds on moment 
generating functions through a chain of simple inequalities. Let > 0; we have 

IP (Amax {Sn) >t)=F (A^ax {OSn) > Ot) = F (A^ax > ^{Ot)) < 

Etr ^{eSn) 



< 



m) 



The following semidefinite relation is straightforward: 

logEe^^* < (l)i9)EXf. 



(2.1) 



(2.2) 
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Indeed, writing the series expansion for e^^^ and using that EXj = 0, we obtain 

Ee-^=/, + E^^Xf(l + ... + |f^ + ...)^ 



1 a^w "v^ II fc 

where in the last hne we used the assumption that \\Xi\\ < 1 and monotonicity of It 

remains to apply the inequality I + A ^ which holds for self-adjoint A := ^Xf. 

Next, since ES„ = 0^, Lieb's concavity theorem and Jensen's inequality for conditional expectation 

imply 

Etr (j){9Sn) = Etr (exp(0S„_i + loge^^") - 1^ 

= EE (tr [exp{eSn-i + log e^^") - | Xi . . . X^-i ) < 
< E (^tr (^exp(05„_i + logEe^^") - la 

Iterating this argument, we get 



Etr^ieSn) < tr [ exp ( ^logEe^^' ) - 1^ 



\ \i=l / / 

which together with (2.1) gives 

Etr <j){9S„) < tr {ex.p{(j){9)ESl) - Id) ■ (2.3) 

Note that 

expim^Sl) - Id) = 

= cpm'^'si (i + ^<i>msi + ■ ■ ■ + ^ {<pms'x-' + . . .] e^/^si ^ 



^ m^s'^ ( 1 + ^mms'j + . • • + ^ {mms'j)''-' + 

E5^^(^) -P(^Jj;^)-^ < ^exp(0(^).^). (2.4) 



Combining (2.4) with (2.1), we get 

P (A„,ax (Sn) > t) < tr exp {m<^') exp {-9t) 
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Note that for y > 0, 



ey - y-l 



y2/2 + y3/6 



2 • 



(2.5) 



Choose 9^ := log (l + ^) to minimize exp(0(0)(T2 — 6t). Together with the weh-known inequaUty 



(1 + y) log(l + y)-y> 



yV2 



, y >o 



(2.6) 



l + y/3 

and (2.5), this concludes the proof. It remains to repeat the argument with Xj's replaced by 
(— Xj)'s to obtain a bound for the operator norm. □ 



3. Bernstein's inequality for the sums of martingale differences 

Our next goal is to obtain a concentration inequality for the sums of matrix-valued martingale 
differences. Although we get a slightly weaker bound compared to the previous inequality, it still 
improves the multiplicative dimension factor. For t G M, define p{t) := min(— t, 1). Note that 

1. p{t) is concave; 

2. g{t) := e* — 1 + p{t) is non-negative for all t and increasing for t > 0. 
Recall the following useful result: 

Proposition 3.1 (Peierls inequality). Let / : M i— i? 6e a convex function and {ui, . . . ,Un} - 
any orthonormal basis ofC^. For any self-adjoint A G ([^nxn 



Yf{{u„Au,))<tTf{A) 



i=l 

An immediate corollary of this fact is that A i— )• tr/(^) is convex for a convex real- valued / 
and self-adjoint A: to show that 

trf(^^±^^<litrf(A)+trf{B)), 

it is enough to apply Peierls inequality to the orthonormal system given by the eigenvectors of 
(A + B). 

In particular, since p{t) is concave, it follows from Jensen's inequality that for any random self- 
adjoint matrix Y 



Etrp{Y) < trp{EY). 



(3.1) 



Everywhere below, IEj[ • ] stands for the conditional expectation E[ • \Xi, . . . , Xi]. We are ready to 
prove the main result of this section: 

Theorem 3.1. Let Xi, . . . , X„ be a sequence of martingale differences with values in the set of 
d X d self-adjoint matrices and such that \\Xi\\ < 1 a.s. 
Denote Wn ■= J27=i ^i-i^f- Then, for any t > 



1=1 



> t, A„,ax (Wn) <a^] < 2tr 



p 



exp[-^^{t)]-v„{t), 



where Va{t) = 1 + ^|(^. 
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Remarks: Note that 

1. tip (-^EW„) < d for all t > 0; 

2. fo-(t) is decreasing, so whenever ^a{t) ^ 1, Vf^{t) can be replaced by a constant. 

n 

Proo/. Recah that 0(6') = 6^-9-1. Denote := ^i- 

1=1 

Let be such that 9t — cl){6)a'^ > and define an event E by 
Note that triangle inequality implies 

E 5 {Amax (Sn) > t, Amax (^^n) < • 

We proceed by bounding P (E): 

F{E) = F (A„,ax (5(^S„ - mWn)) > 9{0t - m^^)) < 

< trE (gieSn - mWn)) exp {<P{9y - 9t) • ^ (3-2) 

The second term in the product, exp(i?i>(0)cj^ — 9t), is minimized for 0* := log(l + ^) and 

exp{(P{9^)a^ - 9J) < exp{-^„{t)) (3.3) 

by (2.6). To bound the first term in the product (3.2), note that by Lieb's theorem 

Yk ■.= tveMGSk-mWk) 

is a supermartingale with initial value d(which can be shown similar to theorem 2.1, or see [Trolla] 
for details), so that 

Etr exp{9Sn-<p{9)Wn) < d. 

Together with (3.1), this gives 

tiE g{9Sn - mWn) = trE (exp(05„ - cPi9)Wn) - h + piOSn - (t>{9Wn)) < 
<Etrpi9Sn-mWn) < 

< trp{9ESn - HGWn) = tip{-(j){9)EWn). (3.4) 
Since EWn is nonnegative definite and due to the obvious estimate 

^{9) < - 1, ^ > 
applied for 9 = 9^,, bound (3.4) becomes 

ixEg{eSn - 4>{9,)Wn) < trp (-^EW^ . (3.5) 
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Finally, by (2.6) 



> 



ei2 



> 0, 



and we deduce from (2.5) that 



exp(6l*t - 0(6'*)o-2) 



< 1 



(3.6) 



where ^'(T(i) = Combination of bounds (3. 3), (3. 5), (3. 6) concludes the proof. 



□ 



The expression tr [p(— ^lEW^)] which replaces the dimension factor in our bound has a very 
simple meaning: acting on non-negative definite cone, the function A i— )• p{—A) just truncates 
the eigenvalues of A on the unit level. It is easy to see that if the eigenvalues of EVFn decay 

2 

polynomially, i.e., Aj(EWn) < P > 1, then 

trp (^-^EWn^ < mm{d,ct^/P). 

In particular, this gives an improvement over the bound in [HKZll]. 

Remark. Clearly, both theorem 2.1 and theorem 3.1 easily extend to the case when {Xi} is 
a sequence of self-adjoint Hilbert-Schmidt operators Xi : M M acting on a separable Hilbert 
space H, such that EXj = 0. This can be seen, for example, by showing that Lieb's theorem holds 
for this more general case. We provide another direct approach below. 

Let Li C -L2 C . . . be a nested sequence of finite dimensional subspaces of EI such that U = EI 

j 

and let Pl^ be an orthogonal projector on Lj. For any fixed j, we will apply theorems 2.1 (similarly, 
theorem 3.1) to a sequence of finite dimensional operators |Pf.jXi-PLj } • mapping Lj into itself. 

n 

> almost surely, hence 



Note that II Zi^i - Pl,X,PlM - 
i=i 



i=l 



> t ] < lim inf I 



1=1 



> t 



Note that, since A<B implies SAS* ^ SBS*, taking A = Pl , B = I and S = Pl X gives 



thus 



lim inf 



< 



\ii=i 

AmaxfEE(PL,XiPr, 



i=l 



< lim inf 



tT(Z^{PL,XfPL^ 
Ki=l 



An 



E E(Pi^.X,PLj2 
1=1 



< 



tr ( E 

Ki=l 



tr ( E ^X. 

Ki=l 



hm sup Amax 



ZHPL.XiPL^)'^ 
i=l 



All 



i=l 
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where in the last step we used a simple bound 

11^2 - {Pl^XPl^W = \\X{X - Pl.XPl^) + {X- Pl^XPl^)Pl^XPl^\\ < 
< 2||X||||X - Pl-XPl-W 0, almost surely. 



4. Application: estimation of the integral operators. 

Let (5, n) be a measurable space, with 11 being a probability measure. Let K[-, •) be a symmetric 
continuous positive definite kernel with k := sup \ K{x, x)\ < co and let %k be the corresponding 

x&S 

reproducing kernel Hilbert space. For x G 5, let Kx{-) ■= K(-,x). 
Define the integral operator Lk ■ Tik ^ 'Hk by 

(LkIXx) := J K{x,y)f{y)dU{y) = J K{x,y) {KyJ)^^dU{y) 
s s 
where the second equality follows from the reproducing property. Note that Lk is self-adjoint and 
trace-class, with tvLx = E,K{X,X). In many problems, 11 is unknown and Lk is approximated 
by its empirical version Lxn- 



1 " 

iLK,nf)ix) :=-J](/,i^x,)i^x,(a 



n . 

1=1 



where Xi, . . . , Xn is an iid sample from H. The natural question to ask is: what is the degree of 
approximation provided by Lk,^ measured in the operator norm? Theorem 2.1 gives an answer 
to this question. To apply the theorem, define the operator-valued random variables 



e.:= {■,Kx,)Kx,-Lk. 



Kx 

Note that £,'s are iid with mean zero. Setting u,- = ttj:? — rr — , we have 



hence < 2k. At the same time, since Ui := {-jUi) m is a projector, it satisfies U~ = Ui and 

Kx,)Kxf\ = \\&{\\Kx^\l^,{■^Kx,)Kx;)\\<KnKx^\l^,=nW.K{X,X). 

Note that in many cases 'KK{X,X) is much smaller than k. Applying theorem 2.1, we get 
Corollary 4.1. Under our assumptions on the kernel, 

, trEgf / nt^ 

' """^^ \2k{¥.K{X,X) + 2t/3) ^ 



F {\\LK,n -LK\\>t)< 2^^ exp ( - ) (1 + ^(t)) 



where j{t) 



This can be used together with the fact that 

\\LK,n - Lk\\ > sup \\Xj{LK) - XjiLK,n)\\ 
j 

where the eigenvalues of Lk and LK,n are ordered increasingly. In particular, in many cases our 
bound improves upon the estimate of Proposition 1 in [SZ09]. 
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